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In this paper we study the problem of learning phylogenies and 
r—{ | hidden Markov models. We call a Markov model nonsingular if all 

h— j . transition matrices have determinants bounded away from (and 1). 

' We highlight the role of the nonsingularity condition for the learning 

problem. Learning hidden Markov models without the nonsingularity 
i i condition is at least as hard as learning parity with noise, a well- 

' known learning problem conjectured to be computationally hard. On 

, the other hand, we give a polynomial-time algorithm for learning 

^ ■ nonsingular phylogenies and hidden Markov models. 

1. Introduction. In this paper we consider the problem of learning phy- 
<N ■ lo genies and hidden Markov models, two of the most popular Markov models 

■ used in applications. 

t^. , Phylogenies are used in evolutionary biology to model the stochastic evo- 

lution of genetic data on the ancestral tree relating a group of species. More 

■ precisely, the leaves of the tree correspond to (known) extant species. Inter- 
l/-) \ nal nodes represent extinct species, while the root of the tree represents the 

most recent ancestor to all species in the tree. Following paths from the root 
to the leaves, each bifurcation indicates a speciation event whereby two new 
species are created from a parent. 

The underlying assumption is that genetic information evolves from the 
root to the leaves according to a Markov model on the tree. This genetic 
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information may consist of DNA sequences, proteins, and so on. Suppose, 
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for example, that the genetic data consists of (aligned) DNA sequences and 
consider the evolution of the first letter in all sequences. This collection, 
named the first character, evolves according to Markov transition matrices 
on the edges. The root is assigned one of the four letters A, C, G and T. Then 
this letter evolves from parents to descendants according to the Markov 
matrices on the edges connecting them. 

The map from each node of the tree to the ith letter of the corresponding 
sequence is called the ith character. It is further assumed that the characters 
are i.i.d. random variables. In other words, each site in a DNA sequence 
is assumed to mutate independently from its neighbors according to the 
same mutation mechanism. Naturally, this is an over-simplification of the 
underlying biology. Nonetheless, the model above may be a good model for 
the evolution of some DNA subsequences and is the most popular evolution 
model in molecular biology, see, for example, [12]. 

One of the major tasks in molecular biology, the reconstruction of phylo- 
genetic trees, is to infer the topology of the (unknown) tree from the charac- 
ters (sequences) at the leaves (extant species). Often one is also interested 
in inferring the Markov matrices on the edges. In this paper we pay spe- 
cial attention to computational issues arising from the problem of inferring 
the complete Markov model. We seek to design efficient reconstruction algo- 
rithms or provide evidence that such algorithms do not exist. Here, efficiency 
means both that the length of the sequences used is polynomial in the num- 
ber of leaves — that is, information-theoretic efficiency — and that the number 
of elementary steps performed by the reconstruction algorithm is polynomial 
in the number of leaves — that is, complexity-theoretic efficiency. The main 
technique for proving that a computational problem does not admit an effi- 
cient algorithm is to show that a well-known hard problem is a special case 
of this new problem. This is called a reduction, the most common of which 
is an NP-hardness reduction [16]. Let n be the number of leaves; then the 
fact that a sequence of length Q(n c ), with c > 0, is needed is established 
in [25]. 

In the systematics and statistics literature, three main approaches have 
been studied in depth for the reconstruction problem: parsimony, maximum 
likelihood and distance-based methods (see, e.g., [12, 30] for a detailed review 
and a thorough bibliography). Parsimony is known to be inconsistent [15] 
(it may converge to the wrong tree even if the number of characters tends to 
infinity) and NP-hard [18]. Maximum likelihood is also NP-hard [6, 29], but 
it is consistent [5]. As for distance-based methods, they can be consistent 
and, furthermore, run in polynomial time [10] (under some assumptions, see 
below). However, these methods have not gained popularity in biology yet 
(see [28]). 

Much work has been devoted to the reconstruction of phylogenies in the 
learning setting [2, 7, 13]. In particular, in [7] a polynomial-time algorithm 
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is obtained for Markov models on 2-state spaces in complete generality (in 
particular, there are no assumptions of regularity of the Markov transition 
matrices). Note that the authors of [7] also conjecture that their technique 
extends to the general /e-state model, but then restrict themselves to k = 2. 

One may roughly summarize the different approaches to phylogenetic re- 
construction as follows: 

• In biology the interest is usually in reconstructing the tree topology, as 
well as the full Markov model. However, most work in biology deals with 
special reconstruction methods and there are no results on efficient recon- 
struction. 

• Work in combinatorial phylogeny focuses on efficient reconstruction of 
the tree topology. Here the Markov mutation model on the tree is not 
reconstructed. 

• Work in learning theory shows that tree topologies and Markov models 
can be efficiently reconstructed when the number of character states is 
2. In this paper we discuss the problem of recovering trees and Markov 
models when the number of character states is more than 2. 

The framework of Markov models on trees has several special cases that 
are of independent interest. The case of mixtures of product distributions 
is discussed in [14]. Arguably, the most interesting special case is that of 
learning hidden Markov models (HMMs). HMMs play a crucial role in many 
areas from speech recognition to biology; see, for example, [8, 27]. It is easy to 
encode an HMM as a Markov model on a caterpillar tree. See Figure 1, where 
the states on the top line correspond to the hidden states and the states 
at the bottom correspond to observed states. The arrows going downward 
correspond to functions applied to the hidden states. 

In [1] it is shown that finding the "optimal" HMM for an arbitrary distri- 
bution is hard unless RP = NP (it is widely believed that RP / NP). See 
also [24] where hardness of approximation results are obtained for problems 
such as comparing two hidden Markov models. Most relevant to our setting 
is the conjecture made in [22] that learning parity with noise is computa- 
tionally hard. It is easy to see that the problem of learning parity with noise 
may be encoded as learning an HMM over 4 states. See Section 1.3. 

There is an interesting discrepancy between the two viewpoints taken in 
works concerning learning phylogenies and works concerning learning hidden 
Markov models. The results in phylogeny are mostly positive — they give 
polynomial-time algorithms for learning. On the other hand, the results 
concerning HMMs are mostly negative. 

This paper tries to resolve the discrepancy between the two points of 
view by pointing to the source of hardness in the learning problem. Roughly 
speaking, it seems like the source of hardness for learning phylogenies and 
hidden Markov models are transition matrices P such that detP is (or 
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close to 0) but rankP > 1 (or P is far from a rankl matrix). Note that, 
in the case k = 2, there are no matrices whose determinant is and whose 
rank is more than 1. Indeed, in this case, the problem is not hard [7]. We 
note, furthermore, that in the problem of learning parity with noise all of 
the determinants are and all the ranks are greater than 1. 

The main technical contribution of this paper is to show that the learning 
problem is feasible once all the matrices have (5 < |detP| < 1 — l/poly(n) 
for some (3 > 0. We thus present a proper PAC learning algorithm for this 
case. In the case of hidden Markov models we prove that the model can be 
learned under the weaker condition that l/poly(n) < | detP| < 1. Assuming 
that learning parity with noise is indeed hard, this is an optimal result. See 
Section 1.3. 

The learning algorithms we present are based on a combination of tech- 
niques from phylogeny, statistics, combinatorics and probability. We believe 
that these algorithms may also be extended to cases where | detP| is close 
to 1 and, furthermore, to cases where if | deti-*| is small, then the matrix P 
is close to a rank 1 matrix, thus, recovering the results of [7]. 

Interestingly, to prove our result, we use and extend several previous re- 
sults from combinatorial phylogeny and statistics. The topology of the tree is 
learned via variants of combinatorial results proved in phylogeny [10]. Thus, 
the main technical challenge is to learn the mutation matrices along the 
edges. For this, we follow and extend the approach developed in statistics 
by Chang [5]. Chang's results allow the recovery of the mutation matrices 
from an infinite number of samples. The reconstruction of the mutation ma- 
trices from a polynomial number of samples requires a delicate error analysis 
along with various combinatorial and algorithmic ideas. 

The algorithm is sketched in Section 2 and the error analysis is detailed 
in Section 3. 

1.1. Definitions and results. We let T^{n) denote the set of all trees 
on n labeled leaves where all internal degrees are exactly 3. Note that if 
T = (V,£) £ T3(n), then |V| = In — 2. We will sometimes omit n from the 
notation. Below we will always assume that the leaf set is labeled by the 
set [n] = {1, . . . , n}. We also denote the leaf set by C. Two trees T\,T2 are 
considered identical if there is a graph isomorphism between them that is 
the identity map on the set of leaves [n]. We define a caterpillar to be a tree 
on n leaves with the following property: the subtree induced by the internal 
nodes is a path (and all internal vertices have degree at least 3). See Figure 1 
for an example. We let TCs(n) denote the set of all caterpillars on n labeled 
leaves. 

In a Markov model T on a (undirected) tree T = (V,S) rooted at r, each 
vertex iteratively chooses its state given the one of its parent by an appli- 
cation of a Markov transition rule. Consider the orientation of £ where all 
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edges are directed away from the root. We note this set of directed edges £ r . 
Then the probability distribution on the tree is the probability distribution 
on given by 

(1) * T (s)=^(s(r)) J] P^ )s(v) , 

where s£C v , C is a finite state space, P uv is the transition matrix for 
edge (u,v) G £ r and irj is the distribution at the root. We let k = \C\. We 
write TTyy for the marginals of ir T on the set VV. Since the set of leaves is 
labeled by [n], the marginal ttT-, is the marginal on the set of leaves. We will 
often remove the superscript T. Furthermore, for two vertices u, v G V, we 
let P-j 1 =F[s(v) = j\s(u) =i]. We will be mostly interested in nonsingular 
Markov models. 

Definition 1. We say that a Markov model on a tree T = (V,£) is 
nonsingular [{(5 , (5' , o~)-nonsingular] if we have the following: 

I. For all e€Sr, it holds that 1 > | det P e \ > [1 - (3' > | det P e \ > (3] and 
II. For all v G V, it holds that ir v (i) > \ir v (i) > a] for all i in C. 

It is well known [31] that if the model is nonsingular, then, for each w G V, 
one can write 

(2) 7T T (S)=^(S(W)) J] 

where now all edges (u,v) are oriented away from w. In other words, the 
tree may be rooted arbitrarily. Indeed, in the learning algorithms discussed 
below, we will root the tree arbitrarily. We will actually refer to £ as the set 
of directed edges formed by taking the two orientations of all (undirected) 
edges in the tree. It is easy to show that (/?,/?', cr)-nonsingularity as stated 
above also implies that property I holds for all (u,v) G £ with appropriate 
values of (3, a. 

Transition matrices P satisfying |detP| = 1 are permutation matrices. 
While edges equipped with such matrices preserve information, it is impos- 
sible to deduce the existence of such edges from the phylogenetic data. For 
example, if all edges satisfy that P is the identity matrix, then the characters 
are always constant for all possible trees. 

Note, moreover, that if | detP™| > for all edges (u,v) and for all v G V, 
the distribution of s(v) is supported on at most \C\ — 1 elements, then one 
can redefine the model by allowing only \C\ — 1 values of s(v) at each node 
and deleting the corresponding rows and columns from the transition ma- 
trices P e . (Note that the labels of the character states at internal nodes are 
in fact determined only up to a permutation and similarly for the mutation 
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matrices. This will be explained in more detail below.) Thus, condition II is 
very natural given condition I. 

We call a Markov model as in (1) a phylogenetic tree. Given collections 
M n of mutation matrices P and collections T(n) of trees on n leaves, we let 
T(n) ® M n denote all phylogenetic trees of the form (1), where T G T(n) 
and P e £ M n for all e. Given numbers < a n < 1, we write (T(n) ® M n , a n ) 
for all the elements of T^^n) ® M n satisfying n v (i) > a n for all i £ C and 
w € V. We will be particularly interested in the collections of all binary 
trees on n leaves denoted T3(n) and in the collections of all caterpillars 
on n leaves denoted TC3(n). Our goal is to provide efficient algorithms to 
infer the models Ts(n) ® M n and TCs(n) ® M n , given independent samples 
of the characters at the leaves. However, given any finite amount of data, 
one cannot hope to estimate exactly with probability 1 the tree and the 
transition matrices. Moreover, some degrees of freedom, such as the labels 
of the characters at the internal nodes, cannot be recovered even from the 
exact distribution at the leaves. 

Since the model cannot be recovered exactly, an alternative approach is 
needed. The standard approach in computational learning theory is to use 
the PAC learning framework introduced by [32], here in its variant proposed 
by [22]. PAC learning has been studied extensively in the learning theory 
literature. For background and references, see [23]. 

Let e > denote an approximation parameter, 5 > 0, a confidence param- 
eter, (M n ), collections of matrices, (T(n)), collections of trees, and (o~ n ), a 
sequence of positive numbers. Then we say that an algorithm A PAC-learns 
(T(n) ® M n , a n ) if for all n and all T G (T(n) ® M n , o~ n ), given access to in- 
dependent samples from the measure ir^ , A outputs a phylogenetic tree T' 

such that the total variation distance between 7r7\ and ttT\ is smaller than e 
with probability at least 1 — 5 and the running time of A is poly(n, 1/5,1/e). 
In our main result we prove the following. 

Theorem 1. For every constant /3,K8,Kn > and every finite set C, 
the collection of (P,n~ KlB ,n~ K7T )-nonsingular Markov phylogenetic models is 
PAC-learnable. More formally, let C be a finite set, P,K/3,K n > 0. Let M n 
denote the collection of all \C\ x \C\ transition matrices P, where l — n~ K P > 
| detP| > (3. Then there exists a PAC-learning algorithm for (T^(n) ® M n , 
n~ Kn ) whose running time is poly(n, \C\, 1/e, 1/5). 

Furthermore, if the learning problem has an additional input which is 
the true tree topology, then the assumption on determinants in M n can be 
relaxed to 1 > \ det P\ > (3. 

For hidden Markov models, we can prove more. 
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Theorem 2. Let (f>d,K n >0 be constants. Let C be a finite set and M n 
denote the collection of \C\ x \C\ transition matrices P, where 1 > jdetPj > 
n -<t>d _ Then there exists a PAC-learning algorithm for (TC3(n)(S>M n ,n _K,r ). 
The running time and sample complexity of the algorithm is poly(ro, |C|, 
l/e,\/S). 

In many applications of HMMs, the state spaces at different vertices are of 
different sizes and, therefore, many of the Markov matrices have determi- 
nant. Theorem 2 is not applicable in these cases. Indeed, then, the negative 
result presented in Section 1.3 may be more relevant. 

1.2. Inferring the topology. We let the topology of T denote the under- 
lying tree T = (V,£). The task at hand can be divided into two natural 
subproblems. First, the topology of T needs to be recovered with high prob- 
ability. Second, the transition matrices have to be estimated. Reconstructing 
the topology has been a major task in phylogeny. It follows from [10, 11] 
that the topology can be recovered with high probability using a polynomial 
number of samples. Here is one formulation from [26]. 

Theorem 3. Let (3 > 0, Kr > and suppose that M n consists of all 
matrices P satisfying (3 < |detP| < 1 — n~ K f. For all kt > 2, the topol- 
ogy of T G (Ts(n) <8> M n ,n _K,r ) can be recovered in polynomial time using 
n O{i/ p+Kp+HT+K^) sam pi es yjitfo probability at least 1 — n 2 ~ KT . 

We will also need a stronger result that applies only to hidden Markov 
models. The proof, which is sketched in the Appendix, is quite similar to 
the proofs in [10, 11]. 

Theorem 4. Let £ > 0, > and suppose that M„ consists of all 
matrices P satisfying < |detP| < 1. Then for all 6 > 0,r > 0, and 
all T G (TC3(n) (8> ~M. n ,n~ Kw ), one can recover from n °((+ d + T + K K) sam pl es 
a tree-topology T with probability 1 — n~ e , where the topology T satis- 
fies the following. It is obtained from the true topology T by contracting 
some of the internal edges whose corresponding mutation matrices P satisfy 
| det P\ > 1 — n~ T . Note that T may have some of its internal degrees greater 
than 3. 

1.3. Hardness of learning singular models. We now briefly explain why 
hardness of learning "parity with noise" implies that learning singular hidden 
Markov models is hard. We first define the parity-learning problem, which 
has been extensively studied in the computational learning theory. See, in 
particular, [3, 4, 19, 21]. 
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Definition 2 (Learning parity with noise). Let x = (x±, . . . ,x n ) be a 
vector in {0, l} n , T a subset of {1, ... ,71} and < a < 1/2. The parity of 
x on T is the Boolean function, denoted </>t(x) =0j g T x «> which outputs 
if the number of ones in the subvector {xijigr is even and otherwise. 
A uniform query oracle for this problem is a randomized algorithm that 
returns a random uniform vector x, as well as a noisy classification /(x) 
which is equal to 4>t( x ) with probability a > and 1 — <^>t( x ) otherwise. All 
examples returned by the oracle are independent. The learning parity with 
noise problem consists in designing an algorithm with access to the oracle 
such that, for all e, 5 > 0, the algorithm returns a function h : {0, l} n — ► {0, 1} 
satisfying P x [/t(x) = 0t( x )] > 1 — £ (where x is uniform over {0,1}™) with 
probability at least 1 — 5 in time polynomial in n, 1/e, 1/6. 

Kearns' work [21] on the statistical query model leads to the following 
conjecture. 

Conjecture 1 (Noisy parity assumption [21]). There is an a with < 
a < 1/2 such that there is no efficient algorithm for learning parity under 
uniform distribution in the PAC framework with classification noise a. 

In [22] , this is used to show that learning probabilistic finite automata with 
an evaluator is hard. It is easy to see that the same construction works with 
the probabilistic finite automata replaced by an equivalent hidden Markov 
model (HMM) with 4 states (this is a special case of our evolutionary tree 
model when the tree is a caterpillar) . The proof, which is briefly sketched in 
Figure 1, is left to the reader. We remark that all matrices in the construction 
have determinant and rank 2. Note, moreover, that, by a standard coupling 
argument, it follows that if for all edges (u,v) we replace the matrix P uv 
by the matrix (1 — n~ T )P uv + n~ T I, then the model given in Figure 1 and 
its variant induces undistinguishable distributions on K samples if _K" < 
o(n T_1 ). This shows that, assuming that learning parity with noise is hard, 
Theorem 2 is tight up to the constant in the power of n. 

2. Overview of the algorithm. 

2.1. Chang's spectral technique. One of the main ingredients of the al- 
gorithm is the following important result due to Chang [5] that we rederive 
here for completeness. Let T be a 4-node (star) tree with a root r and 3 
leaves a,b,c. (See Figure 2.) Let P uv be the transition matrix between ver- 
tices u and v, that is, Pf!j v = F[s(v) = j\s(u) = i] for all i,j £ C. Fix 7 G C. 
Then by the Markov property, for all i,j G C, 

F[s(c) = 7, a(b) = j\8(a) =i] = Yl P?kP r h C ,P$, 
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Fig. 1. Hidden Markov model for noisy parity. The model computes N ®Q) iGT Xi, where 
the Xi's are uniform over {0, 1}, T is a subset of {I, . . . ,n}, and N is a small random noise. 
The Si = @ j eT j Ki Xj are the partial sums over variables included in T. The observed 
nodes are in light gray. The hardness proof follows from a standard reduction technique 
similar to that used in [22] . 



or, in matrix form, P ab <"t = P ar diag(P.! y c )P r , where the matrix P a6 >T is de- 
fined by 

^,7 = p [s(c)=7)S(6)=i | s(a)= ^ 
for all i,j e C. Then, noting that (P" 6 )" 1 = (P r6 )- 1 (P ar )- 1 , we have 

(*) pab, 7 ,pabyl = par diag (p^) (par)-l 9 

assuming the transition matrices are invertible. Equation (*) is an eigen- 
value decomposition where the l.h.s. involves only the distribution at the 
leaves. Therefore, given the distribution at the leaves, we can recover from 
(*) the columns of P ar (up to scaling), provided the eigenvalues are distinct. 
Note that the above reasoning applies when the edges (r, a), (r,b), (r, c) are 
replaced by paths. Therefore, loosely speaking, in order to recover an edge 
(w,w'), one can use (*) on star subtrees with w and w' as roots to ob- 
tain P aw and P aw ', and then compute P ww ' = (p<™)-lp<™'. i n [5], under 
further assumptions on the structure of the transition matrices, the above 
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Fig. 2. Star tree with three leaves. 
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scheme is used to prove the identifiability of the full model, that is, that the 
output distributions on triples of leaves uniquely determine the transition 
matrices. In this paper we show that the transition matrices can actually be 
approximately recovered using (★) with a polynomial number of samples. 

There are many challenges in extending Chang's identifiability result to 
our efficient reconstruction claim. First, as noted above, equation (★) uncov- 
ers only the columns of P ar . The leaves actually give no information on the 
labelings of the internal nodes. To resolve this issue, Chang assumes that the 
transition matrices come in a canonical form that allows to reconstruct them 
once the columns are known. For instance, if in each row, the largest entry is 
always the diagonal one, this can obviously be performed. This assumption 
is a strong and unnatural restriction on the model we wish to learn and, 
therefore, we seek to avoid it. The point is that relabeling all internal nodes 
does not affect the output distribution, and, therefore, the internal labelings 
can be made arbitrarily (in the PAC setting). The issue that arises is that 
those arbitrary labelings have to be made consistently over all edges sharing 
a node. Another major issue is that the leaf distributions are known only 
approximately through sampling. This requires a delicate error analysis and 
many tricks which are detailed in Section 3. The two previous problems are 
actually competing. Indeed, one way to solve the consistency issue is to fix 
a reference leaf u and do all computations with respect to the reference leaf, 
that is, choose a = u> in every spectral decomposition. However, this will 
necessitate the use of long paths on which the error builds up exponentially. 
Our solution is to partition the tree into smaller subtrees, reconstruct con- 
sistently the subtrees using one of their leaves as a reference, and patch up 
the subtrees by fixing the connecting edges properly afterward. We refer to 
the connecting edges as separators. 

A detailed version of the algorithm, FullRecon, including two subrou- 
tines, appears in Figures 3, 4 and 5. The correctness of the algorithm uses 
the error analysis and is therefore left for Section 3. The two subroutines 
are described next. 

2.2. Subtree reconstruction and patching. We need the following notation 
to describe the subroutines. If e = (u,v), let d e (u) be the length of the 
shortest path (in number of edges) from u to a leaf in C not using edge e. 
Then the depth of T is 

A= max {max{ d e (u), d e (v)}}. 
e=(u,«)e£ 

It is easy to argue that A = O(logn). (See Section 3.) Also, for a set of 
vertices W and edges S, denote Af(W,S) the set of nodes not in W that 
share an edge in £ \S with a node in W ( "outside neighbors" of W "without 
using edges in 5"). Let be the subset of nodes in V at distance at most 
A from leaf a. 
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Algorithm FullRecon 

Input: t ret 1 topology; independent samples at leaves; 
Output: estimated mutation matrices {P e } eE £; 

• Phase 0: Initialization 

- Let w be any leaf in £, and initialize the set of uncovered leaves 
U := 0; Set t := and a v := u>; 

- Initialize the separator set S to 0; 

• Phase 1: Subtree Reconstruction 

- While U^C, 

* Perform LeafRecOn(o.j ) ; set t := t + .1; 

• Phase 2: Patching Up 

- Let T be the last value taken by t above; 

- For t := 1 to T, perform SepRecon(« ( , (w t , w' t ),a' t ) [the variables 
at,a' t ,wt,w' t are computed by LeafRecon]. 

• Phase net Normalization 

- For all edges (w,w'), compute the final (normalized) estimate p ww> 
using (6) in Section 3. 

Fig. 3. Algorithm FullRecon. 

Subroutine LeafRecon. The subtree reconstruction phase is performed 
by the algorithm LeafRecon depicted in Figure 6. The purpose of the sub- 
routine LeafRecon is to partition the tree into subtrees, each of which 
has the property that all its nodes are at distance at most A from one of 
its leaves (same leaf for all nodes in the subtree), which we refer to as the 
reference leaf of the subtree. The correctness of the algorithm, proved in 
Section 3, thus establishes the existence of such a partition. This partition 
serves our purposes because it allows (1) to reconstruct mutation matrices in 
a consistent way (in each subtree) using reference leaves, and (2) to control 
the building up of the error by using short paths to the reference leaf. The 
matrix reconstruction is performed simultaneously by LeafRecon, as the 
partition is built. At the call of LeafRecon, we consider the subgraph T 
of T where edges previously labeled as separators have been removed. We 
are given a reference leaf a and restrict ourselves further to the (connected) 
subtree T a of T consisting of nodes at distance at most A from a. Mov- 
ing away from a, we recover edge by edge the mutation matrices in T a by 
Chang's spectral decomposition. At this point, it is crucial that (1) we use 
the transition matrices previously computed to ensure consistency in the 
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Algorithm LeafK egos 

Input: tree topology; reference leaf a; uncovered leaves U\ separator set S; 
independent samples at the leaves; 

Output: estimated mutation matrices P e and estimated node distributions tt u 
on edges and vertices in subtree associated to a; 

• Set W := {«}, U :=Uu{a}; 

• While N(W,S) n 

- Pick any vertex r € S/(W, 5)nSf; 

- Let (rQ,r) be the edge with endpoint r in the path from a to r; 

- If ro = a, set P aT ° to be the identity, otherwise P aT ° is known from 
previous computations; 

- If r is not a leaf, 

* Let (r, 7*1) and (r, T2) be the other two edges out of r; 

* Find the closest leaf b (resp. c) connected to t\ (resp. rj) by a 
path not using {r*o,r); 

* Draw Gaussian vector U; Perform the eigenvalue decomposition 
(*') (see Section 3); 

* Build X out of an arbitrary ordering of the columns recovered 
above; 

* Compute 7} = (X)" 1 , and set P ar := Xdiagf^); 

- Otherwise if r is a leaf: 

* Use estimate from leaves for P ar ; 

* SetLi :=UU{r}\ 

- Compute P r ° r := (P a ^)~ l pV'; 

- Compute 7T r = 7r u P° r ; 

- Obtain P rr ° from P n,r , % and f ro using Bayes rule; 

- Set W := WU{r}; 

- If r is not a leaf and the distance between a and r is exactly A, for 
1= 1,2, 

* If {r,r, } is not in «S, set (' ;= \S\ + 1, a ( < := b (if ( = L and 
c o.w.), aj, := a, {u;*' , u;{, ) := (r,.,r), and add {r, r t } (as an 
undirected edge) to S. 

Fig. 4. Algorithm LeafRecon. 

labeling of internal nodes, and that (2) in order to control error we choose 
the leaves b and c (in the notation of the previous subsection) to be at dis- 
tance at most A + 1 from the edge currently reconstructed (which is always 
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Algorithm SepRecon 

Input: leaves a, a' and separator edge (w, w'); 
Output: estimated mutation matrices P ww , P ww ; 

• Estimate P aa '; 

• Compute P ww ' := (po«J-Ij&**'(#«V)-lj 

• Obtain P w w from P mv > , #,„ and fiv using Bayes rule. 

Fig. 5. Algorithm SepRecon. 



possible by definition of A). Note that the paths from the current node to 
b and c need not be in T a . Once T a is reconstructed, edges on the "outside 
boundary" of T a (edges in T with exactly one endpoint in T a ) are added to 
the list of separators, each with a new reference leaf taken from the unex- 
plored part of the tree (at distance at most A). The algorithm LeafRecon 
is then run on those new reference leaves, and so on until the entire tree is 
recovered. (See Figure 6.) The algorithm is given in Figure 4. Some steps 
are detailed in Section 3. We denote estimates with hats, for example, the 
estimate of P ar is P ar . 

Subroutine SepRecon. For its part, the algorithm SepRecon consists 
in taking a separator edge (w,w') along with the leaf a' from which it was 
found and the new reference leaf a it led to, and computing 




v ~ „ ci4 



Fig. 6. Schematic representation of the execution of LeafRecon. The only edges shown 
are separators. 
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where the matrices P aw , P w ' a ' have been computed in the subtree recon- 
struction phase and P aa can be estimated from the data. It will be impor- 
tant in the error analysis that the two leaves a, a' are at distance at most 
A from w,w' respectively. We then use Bayes' rule to compute P w w . See 
Figure 5. 

2.3. Modifications. The previous description of the reconstruction algo- 
rithm is rather informal. Also, we are led to make a few modifications to 
the basic algorithm. Those are described where needed in the course of the 
analysis in the next section. Here is a list of the changes, all of which appear 
in the figures where the corresponding routines are detailed. 

1. In Chang's spectral technique, it is crucial that the eigenvalues in (*) be 
different and actually well separated. Below, we multiply the system (★) 
by random Gaussians and obtain the new system (*'). See "Separation 
of exact eigenvalues." 

2. In (*'), once the eigenvectors are recovered, they have to be normalized 
properly to obtain the estimated transition matrix P ar . This is detailed 
in the subsection "Error on estimated eigenvectors." 

3. All estimated transition matrices have to be made stochastic. This is done 
in subsection "Stochasticity." 

3. Error analysis. As pointed out in the previous section, the distri- 
bution on the leaves is known only approximately through sampling. The 
purpose of this section is to account for the error introduced by this approx- 
imation. 

For W a subset of vertices of T, recall that 7r>v is the joint distribution 
on W. We denote by 7ryv our estimate of 7Tyv obtained by using the estimated 
mutation matrices. For a vertex u, we let n u (-) = P[s(u) = ■]. We denote by 1 
the all-one vector (the size is usually clear from the context). For any vector 
p, we let diag(p) be the diagonal matrix with diagonal p. Recall that for a 
vector x, \\x\\i = J2i\ x i\i an d for a matrix X, \\X\\i = max,,- J2i \ x ij\ ■ Recall 
that [n] stands for the set of leaves. From now on, we assume that the tree is 
known and that the model is 0, n~ KlT )-nonsingular, for (3, K n > constant. 
Theorems 1 and 2 both follow from the following theorem. 

Theorem 5. Assume the tree is known and the model is ((3,0,n~ K7T )- 
nonsingular, for /3,K n >0 constant. For all e, 5 > and n large enough, the 
reconstruction algorithm produces a Markov model satisfying 

\\n[ n ] ~ K[n]\\l < £ ; 

with probability at least 1 — 5. The running time of the algorithm is polyno- 
mial in n, 1/e, 1/5. 
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We can now prove Theorem 1. 

Proof of Theorem 1. First apply Theorem 3 to recover the topology. 
Then apply Theorem 5 to infer the transition matrices. □ 

The proof of Theorem 2 is similar — it uses Theorem 4 instead of Theo- 
rem 3. The proof is given in the Appendix. 

Below we use the expression with high probability (w.h.p.) to mean with 
probability at least 1 — l/poly(n). Likewise, we say negligible to mean at 
most l/poly(n). In both definitions it is implied that poly(n) is 0(n K ) for 
a constant K that can be made as large as one wants if the number of 
samples is 0(n K ) with K' large enough. Standard linear algebra results 
used throughout the analysis can be found, for example, in [20]. 

In the rest of this section, (3, K n and k = \C\ are fixed constants. In par- 
ticular, polynomial factors in (3 and k are dominated by polynomials in n. 

3.1. Approximate spectral argument. In this subsection we address sev- 
eral issues arising from the application of Chang's spectral technique to an 
approximate distribution on the leaves. Our discussion is summarized in 
Proposition 1. We use the notation of Section 2.1. 

Proposition 1. Let a be a leaf and let r be an internal vertex at distance 
at most A from a. Then there exists a relabeling of the states at r so that 
the estimate P ar recovered from (*) using poly(n) samples is such that the 
error \\P ar — P ar ||i is negligible w.h.p. 

The relabeling issue mentioned in Proposition 1 will be tackled in Sec- 
tion 3.2. 

Determinants on paths. The estimation error depends on the determi- 
nant of the transition matrices involved. Since we use Chang's spectral tech- 
nique where a — > r, r -^b and r — ► c are paths rather than edges, we need a 
lower bound on transition matrices over paths. This is where the use of short 
paths is important. Multiplicativity of determinants gives immediately that 
all determinants of transition matrices on paths of length 0(A) are at least 
l/poly(re). 

Lemma 1 (Bound on depth). The depth A of any full binary tree is 
bounded above by log 2 n + 1 . 

Proof. Because the tree is full, the inequality A > d implies that there 
is an edge on one side of which there is a complete binary subtree of depth 
d. Since there are only n vertices in the tree, we must have A < log 2 n + 1. 
□ 
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Lemma 2 (Determinants on paths). Fix 9 > — 21og 2 /? constant. Let a,b 
be vertices at distance at most 2A + 1 from each other. Then, under the 
(P,0,n~ K ^)-nonsingularity assumption, the transition matrix between a and 
b satisfies | det[P aii ]| > n~ e for n large enough. 

Proof. This follows from the observation that P ab is the product of the 
mutation matrices on the path between a and b. Every matrix has its deter- 
minant at least (3 (in absolute value) . By the multiplicativity of determinants 
and Lemma 1, 

| det[P a6 ]| > P 2A+1 > /3 21 °g2™+ 3 > n~ e , 
for re large enough. □ 

Error on leaf distributions. The algorithm estimates leaf distributions 
through sampling. We need to bound the error introduced by sampling. Let 
a, b, c be leaves at distance at most 2A + 1 from each other and consider the 
eigenvalue decomposition (★). We estimate P ab by taking poly (re) samples 
and computing 

N a ' b 

pab _ M 

y N a ' 

for i,j G C, where Nf is the number of occurrences of s(a) = i and N^'j is 

the number of occurrences of s(a) = i,s(b) = j. Likewise, for -P" 6 ' 7 , we use 
poly(re) samples and compute the estimate 

AT<2,b,C 

pab,-/ _ 
ij N a ' 

i 

where N^p^ is the number of occurrences of s(a) = i, s(b) = j, s(c) = 7. 
We also bound the error on the 1-leaf distributions; this will be used in 
the next subsection. We use poly(re) samples to estimate 7r a using empiri- 
cal frequencies. Standard concentration inequalities give that \\P ab — P ab \\±, 
\\P ab >~f — _P ab '">'|| 1 , and ||7r a — 7r a ||i are negligible w.h.p. 

Lemma 3 (Sampling error). For all e,p>0, there is an s > such that 
if the number of samples is greater than n s , then, with probability at least 
1 — l/n p , the estimation error on the matrices P ab and P ab ^ satisfies 

\\p ab — P ab \\i < — ||p ab >7 _ P^'TIL < — 

— n e ' — re e ' 

for all a,b£ £ and 7 G C, and the estimation error on the leaf distributions 
satisfies 

1 
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for all a 6 C, provided n is large enough. 

Proof. Note that the nonsingularity assumption ensures that, for each 
leaf a and state i, if one uses a sample size n s with s large enough, then 
w.h.p. there will be poly(n) samples where s(a) = i. The bounds then follow 
from Hoeffding's inequality. □ 

Separation of exact eigenvalues. In Section 2 it was noted that the eigen- 
values in (*) need to be distinct to guarantee that all eigenspaces have di- 
mension 1. This is clearly necessary to recover the columns of the transition 
matrix P ar . When taking into account the error introduced by sampling, 
we actually need more. From standard results on eigenvector sensitivity, it 
follows that we want the eigenvalues to be well separated. A polynomially 
small separation will be enough for our purposes. We accomplish this by 
using a variant of an idea of Chang [5] which consists in multiplying the 
matrix P rc in (*) by a random Gaussian vector. One can think of this as 
adding an extra edge (c, d) and using leaves a, b, d for the reconstruction, ex- 
cept that we do not need the transition matrix P cd to be stochastic and only 
one row of it suffices. More precisely, let U be a vector whose k entries are 
independent Gaussians with mean and variance 1. We solve the eigenvalue 
problem (*) with P™ replaced by T = (fj)^ =1 = P rc U, that is, we solve 

(*') P ab ' U (P ab )- l = P ar diag(T)(P ar y\ 

where P ab ' U is a matrix whose (i,j)th entry is 

Y / U^[s(c) = 1 ,s(b)=j\s(a) = i]. 

To see that (★') holds, multiply the following equation (in matrix form) to 
the right by (P" 6 )- 1 = (pr6)-i(par)-i. 

[P ar diag(T)P% = £ P%v h P$ 
hec 




_ \ " tt \ " par pre prb 

-yec h£C 

pab,U 

~ ij 

A different (independent) vector U is used for every triple of leaves consid- 
ered by the algorithm. Next we show that w.h.p. the entries of T = (t>i)f =1 
are l/poly(ra)-separated. 
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Lemma 4 (Eigenvalue separation). For all d > 8 and p < d — 8 , with 
probability at least 1 — n~ p , no two entries of T = (fj)^ =1 are at distance 
less than n~ d for all n large enough. 

Proof. By Lemma 2, | det[P rc ]| > n~ e . Take any two rows i,j of P rc . 
The matrix, say, A, whose entries are the same as P rc except that row i is 
replaced by P[. c — PJ. C , has the same determinant as P rc . Moreover, 

k k 

idet[yi]i<^ni^ CTW i<niiA.iii, 

<y h=\ h=i 
where the sum is over all permutations of {1, . . . , k}. Therefore, 

\\P[. C -Pjli >n~ e . 

By the Cauchy-Schwarz inequality, 

\\Pl c -P^\\ 2 >l/(Vkn e ). 

Therefore, (P[. c — Pj?)U is Gaussian with mean and variance at least 
l/(kn 2e ). A simple bound on the normal distribution gives 

1 Vkn 9 

There are 0(k 2 ) pairs of rows to which we apply the previous inequality. 
The union bound gives the result. □ 

Error on estimated l.h.s. On the l.h.s. of (*'), we use the following esti- 
mate P ab ' U = J^yec U^P^'" 1 . Below we show that the error on the l.h.s. of 
(*') is negligible w.h.p. 

Lemma 5 (Error on l.h.s.). For all e,p > 0, there is an s > such that 
if the number of samples is greater than n s , then, with probability at least 
1 — l/n p , the error on the l.h.s. of (*') satisfies 

iipab,Ufpab\—l _ pab,U tpab\ — 1 11 ^ jj_ 

~ n e 

for all n large enough. 

Proof. From the submultiplicativity of || • ||i, we obtain 

Upab,U ^pab^ — 1 pab,U ^pab^ — 1 1| ^ 

(3) < iip^HiiKP 06 )- 1 - (p^)- 1 ^ + iKP^y 1 ^]^' 17 - p ab > u \\i 

yab\~l ( TD<ib\ — In II r>ab,U TjabJJ I 



,d 



_|_ n^pab^ — 1 _ ^pab^ — 111 npab,U _ pa 



1- 
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so it suffices to prove that each term on the r.h.s. can be made small enough. 
First, note that, using a standard formula for the inverse, we have 

( 4 ) \(^% 1 \ = ^^\(^l P °"]h\<n e , 

where we have used the nonsingularity assumption, and the fact that the 
quantity adj[P or, ]y is the determinant of a substochastic matrix. Therefore, 
\\{P ab y% < kn e . 

A standard linear algebra result [20] gives 

1 1 / pab \ — 1 1 1 1 1 pab pab 1 1 



< 



1 - \\(p ab )- 1 \\l\\p ab - P ab \\l 

2k 2 n 2e 



rr 



where e' is the e from Lemma 3 and is taken larger than 6 so that the 
denominator on the first line is less than 1/2. 
We now compute the error on p ab > u . We have 



^pab,U _pab,U^ < \U^\\\P ahn -P afe ' 7 ||! < -^ll^lll- 

Also, 



7GC 



\pab,u^ x < {U^WP^IU < k\\U\\ v 

7SC 



By a simple bound (see, e.g., [9]), for g > 0, 



r- — ; — - 1 exp(-21og(n3)/2) 



2vr y/2\og{n9) n^ir log (n») ' 



So with probability at least 1 — k/ (n 9 y / 7rlog(nSj ) , we get 



||l7||i<fcy21og(nff). 
Taking s (and therefore e') large enough, the above bounds give 



^pab,U ^pab^ — 1 pab,U ^pab^ — 1 1 



1 



2fc 4 n 2 V21og(n3) £: 2 nV21og(r^) 2/c 3 ?i 2 V21og(n9) 

1 

<— ■ □ 

1l e 
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Separation of estimated eigenvalues. We need to make sure that the es- 
timated l.h.s. of (V) is diagonalizable. By bounding the variation of the 
eigenvalues and relying on the gap between the exact eigenvalues, we show 
that the eigenvalues remain distinct and, therefore, P ab ' U (p *)- 1 j s diago- 
nalizable. 

Lemma 6 (Sensitivity of eigenvalues). For all p > 0, there is an s > 
such that if the number of samples is greater than n s , then, with probabil- 
ity at least 1 — l/n p , the l.h.s. of (*'), p ab U (p ab ^- 1 ; { s diagonalizable and 
all its eigenvalues are real and distinct. In particular, all eigenspaces have 
dimension 1. 

Proof. Fix d = p + 9 in Lemma 4. A standard theorem on eigenvalue 
sensitivity [20] asserts that if Vj is an eigenvalue of P ab < u (p *)- 1 ; there is 
an eigenvalue Vi of P ab U suc ] 1 that (recall that P ar is the matrix of 
eigenvectors) 

\vj-Vi\ < \\P ar \\ 1 \\(P ar y 1 \\ 1 \\P ab ' U (P ab )~ 1 - P ab > u (p^y 1 ^ 
(5) fcV 1 

~ n e ~ 3n rf ' 

where e from Lemma 5 is taken large enough so that the last inequality 
holds. We have also used (4) from Lemma 5. Given that the separation 
between the entries of T is at least l/n d by Lemma 4, we deduce that there 
is a unique Vi at distance at most l/(3n d ) from vj (note that j might not 
be equal to i since the ordering might differ in both vectors). This is true 
for all j E.C. This implies that all Vj's are distinct and, therefore, they are 
real and P ab ' U (p ab ^-i - IS diagonalizable as claimed. □ 

Error on estimated eigenvectors. From (*'), we recover k eigenvectors 
that are defined up to scaling. Assume that, for all i 6 C, i>i is the estimated 
eigenvalue corresponding to Vi (see above). Denote by X % , X 1 their respec- 
tive eigenvectors. We denote X (resp. X) the matrix formed with the X l, s 
(resp. X l, s) as columns. Say we choose the estimated eigenvectors such that 
|| A" 4 ||i = 1. This is not exactly what we are after because we need the rows 
to sum to 1 (not the columns). To fix this, we then compute rj = A _1 l. 
This can be done because the columns of X form a basis. Then we define 
X 1 = rjiX 1 for all i with the corresponding matrix X. Our final estimate 
par = x is & rescaled version of X with row sums 1. The careful reader 
may have noticed that some entries of X may be negative. This is not an 
issue at this point. We will make sure in Lemma 12 that (one-step) mutation 
matrices are stochastic. Next we show that \\X — X\\\ is negligible w.h.p. 
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Lemma 7 (Sensitivity of eigenvectors). For all e,p > 0, there is an s > 
such that if the number of samples is greater than n s , then, with probability 
at least 1 — l/n p , we have 

1 



X-X !<-, 



for all n large enough. 



Proof. We want to bound the norm of X % — X % . We first argue about 
the components of X % — X 1 in the directions X 3 , j ^i. We follow a standard 
proof that can be found, for instance, in [17]. We need a more precise result 
than the one stated in the previous reference and so give the complete proof 
here. 

Because the X l, s form a basis, we can write 

jec 

for some values of pij's. Denote A = p ab -,u ^pab^-i ^ ^ _ ^. _ v . anc j g _ 

pab,U ^pab^ — 1 pab,U ^pab^ — l 'pj^g-Q 

(A + E)X* =ViX\ 
which, using AX % = ViX 1 , implies 

VjPijX^ + EX i = v l J2 PijX j + A,A'. 

For all j G C, let Z J be the left eigenvector corresponding to Vj. It is well 
known that {X^) T Z 3 ' = for all j + j' (see, e.g., [17]). Fix h ± i G C. Multi- 
plying both sides of the previous display by Z h and rearranging gives 

{Z h ) T {EX') + ^i{Z h ) T P 



Pih = 



(v h -Vi)(Z^Xh 



Here we make the X 1 be equal to the columns of P ar and the Z l, s equal 
to the columns of ((P ar ) T )~ 1 . In particular, we have (Z h ) T X h = 1. Recall 
that the X l, s were chosen such that ||A^||i = 1. Fix d = p + 9 in Lemma 4 
so that \vh — V{\> n~ d . Choose the value of e in Lemma 5 large enough so 
that the error \vi — v%\ in (5) (ref. proof of Lemma 6 where j is now i by 
the construction above) is less than l/n d+e for some fixed e 1 > (the d in 
Lemma 6 is d + e' here). Then using standard matrix norm inequalities, the 
Cauchy— Schwarz inequality and Lemmas 4, 5 and 6, we get 



\Pih\ 



{Z h ) T {EX i ) + ^ i {Z h ) T X 



{v h -Vi)(Z h ) T X h 
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< 



\Z h \ 



\E 



| 2 ||X*|| 2 + (l/n d+e, )||Z fa || 2 ||^|| 2 
l/n d 



< n d ||Z ft ||i||X < ||i[A/jfe||-B|| 1 + l/n d+e '} 
2kn e 1 

— „/ — 177) 

for some e" > 0, where we have used the bound ||.Z^||i < kn e which follows 
from (4) in Lemma 6 and the choice of Z h . We also used Lemma 5 to bound 
11-^11 1 — l/(y/kn d+e ) (which is possible if s is large enough; remember that 
A; is a constant). 

We now proceed to renormalize X appropriately. Define X 1 = X l /(1 + pa) . 
From the inequality above, we get 

i = n^iii< ii+Piiiii^iii+Ei^in xi iii 

< k\l + pu\ +k 2 /n e ". 

Assuming that n e " is large enough (i.e., choosing e' above large enough), we 
get 

k 2 /n e " 



\l + Pii\ > 



1 



k 



>0. 



Plugging Xi into the expansion of Xi , we get 



\X l - X 1 



Pij 



+ Pi 



-Xi 



< 



k 2 



< 



for some e'" > 0, where we have used {{X 3 \\i < k, j ^ i. 

Denote q = XI the row sums of X, the matrix formed with the X l, s 
as columns. The scaling between X and X is given by f) = X- l t. (Recall 
the definition of X from the paragraph above the statement of the lemma.) 
Indeed, because the columns of X and X are the same up to scaling, there is 
a vector fj such that X = X diag(fj). By the normalization of both matrices, 
we get 

Xfj = Xt = X diag(7?)l = Xfj. 

Because X is invertible, fj = fj. We want to argue that fj is close to 1, that 
is, that X and X are close. Note that 

Wx-^t-m^wx- 1 ] 



\i] - 1\ 



By the condition \\X l — X i \\i < 1 /n e for all i and the fact that the row sums 
of X are 1 , we get ||l-<z||i < k/n e "'. To bound let E = X-X and 
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note that, using a standard theorem on the sensitivity of the inverse [20], 
||^ _1 ||i< \\(X + E)' 1 - X' 1 + X- 1 ^ 



< 



1 1 - wx-^ixWEWi 

1 



IX" 1 



i - pr-iy.Eiii 

As we have seen before, ||A _1 ||i < kn e and by the bound above, < 
l/n e . Assuming that kn e /n e < 1/2, we get 

HJf" 1 !!! < 2||A~ 1 ||i <2kn e . 

Therefore, 

2k 2 n e 



This finally gives the bound 

|| A- X||i < || A- A||i + || A- A||i 

1 



< ||Adiag(r/) - A||i + 

< ||A||i||77-l||i + 



1 



< 



+ k 



ir 



tr 



2k 2 n 9 




pin 


+ - 


n e 


n 


1 


1 






pin — 


7 ' 



if e'" (i.e., e') is large enough (where e on the last line is the one in the 
statement of Lemma 7) . □ 

There is one last issue which is that A is the same as P ar up to permu- 
tation on the states of r. But since relabeling internal nodes does not affect 
the output distribution, we assume w.l.o.g. that P ar = X. We make sure 
in the next subsection that this relabeling is performed only once for each 
internal node. 



3.2. Bounding error propagation. The correctness of the algorithm pro- 
ceeds from the following remarks. 
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Partition. We have to check that the successive application of LeafRe- 
CON covers the entire tree, that is, that all edges are reconstructed. Figure 6 
helps in understanding why this is so. When uncovering a separator edge, we 
associate to it a new reference leaf at distance at most A. This can always 
be done by definition of A. It also guarantees that the subtree associated 
to this new leaf will cover the endpoint of the separator outside the subtree 
from which it originated. This makes the union of all subtrees explored at 
any point in the execution (together with their separators) connected. It 
follows easily that the entire tree is eventually covered. 

Lemma 8 (Partition). The successive application of LeafRecon covers 
the entire tree. 



Proof. We need to check that the algorithm outputs a transition matrix 
for each edge in T. Denote T at the subtree explored by LeafRecon applied 
to at- The key point is that, for all t, the tree T<t made of all T a , for t' < t, 
as well as their separators, is connected. We argue by induction. This is clear 
for t = 0. Assume this is true for t. Because T is a tree, T<t is a (connected) 
subtree of T and (wt+i,w' t+l ) is an edge on the "boundary" of T<t, the leaf 
a t+ i lies outside T< t . Moreover, being chosen as the closest leaf from w' t+1 , 
it is at distance at most A. Therefore, applying LeafRecon to at+i will 
cover a (connected) subtree including w' t+1 . This proves the claim. □ 

Subroutines. Using Lemma 7 and standard linear algebraic inequalities, 
we show that the (unnormalized) estimates computed in LeafRecon and 
SepRecon have negligible error w.h.p. 

Lemma 9 (Error analysis: LeafRecon). Let a be a leaf. For all e,p>0, 
there is an s > such that if the number of samples is greater than n s , 
then, with probability at least 1 — l/n p , all edges (ro,r) reconstructed by 
LeafRecon applied to a satisfy 



' \\i < — 



(after a proper relabeling of the rows and columns of P r ° r to match the 
labeling of P r ° r ), and also 

1 

7T r - 7IY 1 < — , 

(after a proper relabeling of the vertices) for all n large enough. 



Proof. Note that 

r'nrn 1 1 / -nam \ — 1 rtar r nam \ — 1 nar I, 

1) 



\pror _ pror^ _ \Wparo\—l par _ fparo\—lpar\ 
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and, thus, a calculation identical to the proof of Lemma 5 shows that the 

above error can be made negligible w.h.p. Also, 

11^- ^-M 1 1 ,s- r> ar ™- D ar 1 1 
lFr - TTr ||l = |Fa-r - TT a P ||i 

<< 1 1 £■ II II D ar T3 ar II l II,*- ,S- 1 1 II ~D ar 1 1 

s IfoIUIk ||i + Ifo - Toll ill" l|i) 

so, by Lemmas 3 and 5, ||7f r — 7r r ||i can be made negligible w.h.p. 

The algorithm computes the estimate P rr ° by Bayes' rule. Therefore, 



I prro 
I ij 



prro I 
ij 1 



3 l 



< 



TT r (i) 
Trp (j) jpT-pr 

7r r («) 



7T r (i) 



P 



r r I 



+ 



Tro(j) 7r ro (j) 



7f r (i) 7T r (i) 



pror 
ji ' 



Assume ||P r ° r - P 1 " ^! < n e ' and ||7f r - 7ry||i < n e " with e" > K n . Then, 



Tr () (j) 7T ro (j) 



7T r (i) 7T r (z) 



< 



< 



Tr(*)|T ro (j) — 7Tr (i)l +^r U)\^r(i) ~ 7T r (i)| 



7T r (i)7T r (i) 



2n~ 



Therefore, 



l^7°-n7°l< 



(n K,r — n e ") 2 
(1 + n~ e ")n~ 



ij 



+ 



2n" 



n ^ _ n c ^ n K,T T _ n e"y2 

The r.h.s. can be made negligible with a large enough sample size (i.e., large 
enough e',e"). □ 

Lemma 10 (Error analysis: SepRecon). For all e,p>0, there is an s > 
such that if the number of samples is greater than n s , then, with probability 
at least 1 — l/n p , every edge (w,w') reconstructed by SepRecon satisfies 

opww' pww' I <^ _ 

II Hi - n e' 

(after permuting the rows and columns of P ww ' to match the labeling of 
pww j Qr a n n i ar g e en0U gh. 

Proof. Let a, a' be the leaves used by SepRecon to estimate P ww ' . 
Then, 



P u 



(P 



aw\ — lpaa' / pw'a' \ — 1 



rpaw^ — lpaa' ^pw'a' ^ 



'a'\-l\ 



II- 



By applying twice an inequality of the form [3), the r.h.s. can be bounded 
above by a sum of terms involving primarily 1 1 P aw — P aw 1 1 1 , 1 1 P a w —P aw 
and ||P aa -P aa ||i. Those errors can be made negligible w.h.p. by Lemmas 3 
and 5. The algorithm then uses Bayes' rule to compute P w w . The error on 
that estimate can be obtained by a calculation identical to that in Lemma 9. 
□ 



26 



E. MOSSEL AND S. ROCH 



Consistency. Next we prove that all choices of labelings are done consis- 
tently. This follows from the fact that, for each node, say, w, the arbitrary 
labeling is performed only once. Afterward, all computations involving w 
use only the matrix P aw , where a is the reference leaf for w. 

Lemma 11 (Consistency). The labelings are made consistently by sub- 
routines LeafRecon and SepRecon. 

Proof. We briefly sketch the proof. By Lemma 7, we know that, for 
a reference leaf a and an internal node r, the estimated transition matrix 
P ar is close to the exact transition matrix P ar after properly relabeling the 
columns of P ar to match the arbitrary labeling of the columns of P ar . Let 
r r be the permutation matrix performing this relabeling on the columns of 
P ar , that is, such that ||P ar — P ar T r ||i is small. Let r ro be the corresponding 
matrix for node ro- Then, the matrix P r ° r (which, contrary to Lemma 9, we 
assume not to have been relabeled according to P r ° r ) satisfies the equation 

(p ar °r ro )- 1 p ar r r = r- 1 (P or °)- 1 P a, T r 

_ -pT prorp 

The last line is the matrix P r ° r after being properly relabeled according 
the arbitrary choices made by LeafRecon at nodes rg,r. By Lemmas 7 
and 9, this implies that ||P r ° r — r^ ) P r ° T T r ||i is small as required. A similar 

argument applies to the computation of 7f r , fr ro and P rr ° in LeafRecon, 
as well as the computation of p ww \ p^'w^ ^ an j in SepRecon. □ 

Stochasticity. It only remains to make the estimates of mutation matrices 
into stochastic matrices. Say P ww is the (unnormalized) estimate of P ww . 
First, some entries might be negative. Define P™ w to be the positive part 
of P ww . Then renormalize to get our final estimate 

( pww' \ . 

(6) pww' 



(Pi 



1 



for all igC. We know from Lemmas 9 and 10 that P ww is close to P ww 
in L\ distance. From this, we show below that Pf w is also close to P ww 
and ||(-Pf w )i.||i is close to 1 and that therefore — P TOW '|| 1 is negligible 

w.h.p. 

Lemma 12 (Stochasticity). For all e,p > 0, there is an s > such that 
if the number of samples is greater than n s , then, with probability at least 
1 — l/n p , the estimate P ww is well defined and satisfies 

1 
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(after permuting the rows and columns of P ww ' to match the labeling of 
pww'^ j Qr a ^ n i ar g e en0 ugh. 

Proof. Because P ww ' is nonnegative, taking the positive part of P ww ' 
can only make it closer to P ww , that is, 

Assume that, by Lemmas 9 and 10, we have the bound ^p ww — P ww \\i < 
n~ e . Then the row sums of P ww are at least 1 — kn~ e . Also, taking the 
positive part of P ww can only decrease its row sums by kn~ e . Therefore, 
we get 

, OA- 

\\{PDi.\\i>^-^- 

Thus, 



which can be made smaller than n~ e . □ 

Note that we do not renormalize the node distributions because we only 
need to know the distribution at one arbitrary node and that node can 
conveniently be chosen among the leaves. 

Precision and confidence. Now that all matrices have been approximately 
reconstructed, we prove that the distributions on the leaves of the estimated 
and real models are close. We show below that ||7T[ n ] — Traill l is negligible 
w.h.p., thereby proving Theorem 5. 

Lemma 13 (Precision and confidence). Let e, 5 > 0. Using at most poly(n, 
1/e, 1/5) samples, with probability at least 1 — 5, the reconstructed model sat- 
isfies 

\\ft[n] ~ n [n]\\l < £■ 

Proof. We only give a quick sketch. Assume that the number of sam- 
ples is taken large enough so that, by the sequence of lemmas above, the 
bounds on the L\ error on the estimated transition matrices and on the 
estimated node distributions, n~ e , is smaller than e/(2nk) with probability 
at least 1 — 5. By the triangle inequality, ||7T[ n ] — 7T[ n ]||i < 1 1 ttv — ttv 1 1 i , so 
it suffices to bound the L\ error on the entire tree. Now, couple the exact 
model and the estimated model in a standard fashion. We seek to bound the 
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probability that the two models differ at any vertex. Fix an arbitrary root. 
The probability that the models differ at the root is e/(2nk) by assumption. 
Stop if that happens. Otherwise, at each transition, the probability that the 
transition is different in the two models is less than e/(2n) (provided that 
they start from the same initial state). Again, if that happens, stop. Since 
there are at most 2n transitions, by the union bound, the probability that 
we stop at any step in the process is e. □ 

4. Concluding remarks. Many extensions of this work deserve further 
study: 

• There remains a gap between our positive result (for general trees), where 
we require determinants ( 1) and the hardness result which uses deter- 
minants exactly 0. Is learning possible when determinants are f2(n~ c ) or 
even f2(log _c n) (as it is in the case of HMMs)? 

• There is another gap arising from the upper bound on the determinants. 
Having mutation matrices with determinant 1 does not seem like a major 
issue. It does not arise in the estimation of the mutation matrices. But 
it is tricky to analyze rigorously how the determinant 1 edges affect the 
reconstruction of the topology. 

• We have emphasized the difference between k = 2 and k > 3. As it stands, 
our algorithm works only for nonsingular models even when k = 2. It 
would be interesting to rederive the results of [7] using our technique. 

APPENDIX: PROOFS FOR THE CATERPILLAR CASE 

We first sketch the proof for the topology reconstruction in the caterpillar 
case. 

Theorem 4. Let ( > 0, > and suppose that M consists of all ma- 
trices P satisfying < |detP| < 1. Then for all 6 > 0,r' > 0, and all 
TG (TC3 (g> M,n~ K7r ), one can recover from n°^ +e+T '^ samples a topology 
T with probability 1 — n~ d , where the topology T satisfies the following. 
It is obtained from the true topology T by contracting some of the internal 
edges whose corresponding mutation matrices P satisfy | detP| > 1 — n~ T . 

Proof (sketch). We use a distance-based method similar to [10, 11]. 
For general Markov models of evolution, Steel [31] introduced the follow- 
ing metric, known as log-det distance. Let V a b be the set of edges on the 
(unique) path between leaves a,b. Define the matrix F a b = [f a b(i,j)]i,j£C, 
where f a b(i,j) =F[s(a) =i,s(b) = j]. Then, Steel [31] showed that the quan- 
tity ^ab = — In I det(F a b) I defines a tree metric on the set of leaves by deriving 
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the identity ^ a b = J2(u,v)eP ab u uv, where 

u uv = - In | det(P uv )\ - \ In ( [] n u (i)) + \ ln( J[ n v (i)) . 

\iec / ViGC / 

Below, we will need a slightly different expression. Noting that P vu is the 
"time-reversal" of P UV J one immediately obtains 

v uv = -±la.\ det(P uv ) | - \ In | det (P vu ) \ . 

A crucial observation in [10, 11] is that, to obtain good estimates of dis- 
tances with a polynomial number of samples, one has to consider only pairs 
of leaves at a "short" distance. We note ^ ab the estimate of ^ ab . For A > 0, 
define 

Sa = {(a, b) G C x C : i> ab > 2A}. 

Let A = — ln[6n - ^]. Then it follows from [11], Proof of Theorem 14, that, 
for any e,p > 0, there exists an s > large enough so that, using n s samples, 
with probability at least 1 — n~ p , one has, for all (a, b) in S^a, 

\^ab ~ ^ab\ < ~ hi[l - n~ e ] < n~ e , 

and £2 a contains all pairs of leaves with ^ ab < 2A, but no pair with > 
6A. 

We now consider quartets of leaves at a short distance. Define 

%2A = {gG £ 4 : V(a,6) G q, (a,b) G 5 2 a}- 

We then use the four-point method to reconstruct quartets in Z2A- if Q is 
made of leaves a,b,c,d, then (w.l.o.g.) we infer the split {a,b}{c,d} , where 

^ab + ^cd < min{$ ac + # 6d , V ad + * 6c }. 

By [10], Lemma 5, this is guaranteed to return the right topology if, for all 
a',b' G q, 

\^a'b' -*a'6'| < |, 

where x is the length (in the log-det distance) of the internal edge in the 
subtree induced by q. In other words, if Q is the transition matrix on the 
internal edge of q, we can only reconstruct the topology of q if | det(Q)| is 
bounded away from 1. Therefore, we define a threshold 6 = — ln[l — n~ T ] for 
some t > and infer the topology only on those quartets in S2A such that 
(w.l.o.g.) 

^ab + *cd < min{* oc + * M , if ad + # 6c } - 25. 

If we further impose the restriction e > r, then for n large enough, all such 
quartets are correctly reconstructed w.h.p. 
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To contract small internal edges in a consistent fashion, we consider the 
following construction. Consider an undirected graph H with node set C. 
For two nodes a, b, there is an edge (a, b) if (a, b) £ *f> 2 A and if all recon- 
structed quartets put a and b on the same side of their corresponding split. 
Let {H a }a = i be the connected components of H. Ultimately, our inferred 
tree (say, T') will be such that all leaves in a same connected component 
of H form a star in T . To justify this approach, we make the following 
observations. Denote u (resp. v) the internal node of T closest to a (resp. 
b). If 

-±ln|det(P™)| < -ln[l - n~ T ] +ln[l-n" e ], 

then w.h.p. there exists an edge between a and b in H. Conversely, the 
existence of an edge (a, b) in H implies 

-\ In | det (P uv ) | < - ln[l - n~ T } - ln[l - n~ e ] , 

or 

\det(P uv )\ > [1 - n~ T ] 2 [l - n~ e } 2 . 

Because there are at most n nodes in each component of H, if a and b are 
in the same connected component, we have 

I det CP"" ) | > l-n~ T ', 

where r' < 2t — 1. Also, let a and b be two nodes in H connected by an edge 
and assume there is a leaf c in T between a and b (i.e., c is sticking out of 
the path between u and v). Let w be the internal node in T closest to c. 
Then we must have 

-ilnmin{|det(P nw )|,|det(P u,1 ')|} < ±[- ln[l - n~ T ] - ln[l - n~ e ]] 

<-ln[l-n- T ]+ln[l-ra~ e ], 

for n large enough and, therefore, we are guaranteed to have either (a, c) 
or (6, c) in w.h.p. All these observations imply that each connected com- 
ponent of H corresponds to a group of consecutive leaves with an internal 
path having a transition matrix close to identity. Therefore, we can assume 
that the connected components of H form stars in T . 

Finally, we choose a representative leaf l a from each connected component 
H a . Let T" be the subtree of T induced by l\, . . . , I a.- It suffices to estimate 
T" . Then the final estimate T is T", where all representative leaves are 
replaced by their corresponding star. 

The inference of T" is straightforward. Note first that if l a and In are 
leaves in T" and u and v are their respective closest internal nodes in T, 
then 

Vuv > ~ hi[l - n~ T ] + ln[l - n" e ]. 
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Also, by our choice of A and the construction of H, if l a and lp are two 
consecutive leaves in T", then there is at least one reconstructed quartet 
where l a and In are on different sides of the split, that is, each edge in the 
tree T" is represented by a split in the reconstructed quartets. To construct 
T" , we proceed by induction. We recall that a cherry in a tree is a pair of 
leaves whose topological distance is exactly 2. We first identify a cherry by 
finding a pair of leaves which is always on the same side of any split. This 
will be one of the two terminal cherries of T" . Then we remove one of the 
two leaves in that cherry, and start over. This way, we build the tree T" a 
leaf at a time from one end to the other end. □ 

Theorem 2. Let (fid,^ > be constants. Let C be a finite set and M 
denote the collection of \C\ x \C\ transition matrices P, where 1 > |detP| > 
n -<t>d _ Then there exists a PAC-learning algorithm for (TC3(n)(8iM ri ,Ti" K ' r ). 
The running time and sample complexity of the algorithm is poly(n, k, 1 /e, 1 /$) . 

Proof (sketch). From Theorem 4, we can infer a tree T which is ob- 
tained from the true topology T by contracting some of the internal edges 
whose corresponding mutation matrices P satisfy | detP| > 1 — n~ T (refer 
to the proof of Theorem 4 for notation). Now, note that the impossibility 
to infer (efficiently) quartets with a very small internal edge is of no con- 
sequence for the following reason. It is not hard to show that a stochastic 
matrix with a determinant close to 1 (in absolute value) is close to a permu- 
tation matrix. More precisely, for any r' > 0, there is an e' > such that if Q 
is a stochastic matrix with | det(Q)| > 1 — n~ r , then there is a permutation 
matrix J such that || J — Q\\i < n~ e (we omit the proof). Let E T i be the set 
of such transition matrices in our Markov model (only those corresponding 
to internal edges). By relabeling the states at the internal nodes, we can 
assume w.l.o.g. that all transition matrices on E T i are actually close to the 
identity matrix. Then if e' is large enough, any realization of the Markov 
model is such that there is no transition on edges in E T > w.h.p. Put differ- 
ently, from a PAC learning point of view, we can contract any edge in E T i . 
□ 
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